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Principal components analysis (PCA) is a classical method for the reduction of dimensionality of 
QQ I data in the form of n observations (or cases) of a vector with p variables. Contemporary data sets 

CNJ ' often have p comparable to, or even much larger than n. Our main assertions, in such settings, are 

(a) that some initial reduction in dimensionality is desirable before applying any PCA-type search 
^ I ' for principal modes, and (b) the initial reduction in dimensionality is best achieved by working in 

I a basis in which the signals have a sparse representation. We describe a simple asymptotic model 

in which the estimate of the leading principal component vector via standard PCA is consistent if 
and only if p{n)/n 0. We provide a simple algorithm for selecting a subset of coordinates with 
' largest sample variances, and show that if PCA is done on the selected subset, then consistency is 

recovered, even if p(n) 3> n. 

Our main setting is that of signals and images, in which the number of sampling points, or 
pixels, is often comparable with or larger than the number of cases, n. Our particular example here 
^ ' is the electrocardiogram (ECG) signal of the beating heart, but similar approaches have been used, 

04 I say, for PCA on libraries of face images. 

0^ ■ Standard PCA involves an 0(min(p'^, n^)) search for directions of maximum variance. But if we 

have some a priori way of selecting k <^ min(n,p) coordinates in which most of the variation among 
cases is to be found, then the complexity of PCA is much reduced, to 0{k^). This is a computational 
reason, but if there is instrumental or other observational noise in each case that is uncorrelated 
with or independent of relevant case-to-case variation, then there is another compelling reason to 
^*) ' \ preselect a small subset of variables before running PCA. 

Indeed, we construct a model of factor analysis type and show that ordinary PCA can produce 
a consistent (as n — > oo) estimate of the principal factor if and only if p{n) is asymptotically of 
smaller order than n. Heuristically, if p{n) > cn, there is so much observational noise and so many 
^ I dimensions over which to search, that a spurious noise maximum will always drown out the true 

factor. 

Fortunately, it is often reasonable to expect such small subsets of variables to exist: Much recent 
research in signal and image analysis has sought orthonormal basis and related systems in which 
typical signals have sparse representations: most co-ordinates have small signal energies. If such a 
basis is used to represent a signal - we use wavelets as the classical example here - then the variation 
in many coordinates is likely to be very small. 

Consequently, we study a simple "sparse PCA" algorithm with the following ingredients: a) 
given a suitable orthobasis, compute coefficients for each case, b) compute sample variances (over 
cases) for each coordinate in the basis, and select the k coordinates of largest sample variance, c) 
run standard PCA on the selected k coordinates, obtaining up to k estimated eigenvectors, d) if 
desired, use soft or hard thresholding to denoise these estimated eigenvectors, and e) re-express the 
(denoised) sparse PCA eigenvector estimates in the original signal domain. 

We illustrate the algorithm on some exercise ECG data, and also develop theory to show in 
a single factor model, under an appropriate sparsity assumption, that it indeed overcomes the 
inconsistency problems when p{n) > cn, and yields consistent estimates of the principal factor. 
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1 Introduction 



Suppose {xi, z = 1, . . . , n} is a dataset of n observations on p variables. Standard principal 
components analysis (PC A) looks for vectors ^ that maximize 

Var(C^x.)/||Cf. (1) 

If ^1, . . . , ^fc have already been found by this optimization, then the maximum defining S,k+i 
is taken over vectors ^ orthogonal to ^i, . . . , ^fc. 

Our interest lies in situations in which each Xi IS cl realization of a possibly high di- 
mensional signal, so that p is comparable in magnitude to n, or may even be larger. In 
addition, we have in mind settings in which the signals Xi contain localized features, so that 
the principal modes of variation sought by PCA may well be localized also. 

Consider, for example, the sample of an electrocardiogram (ECG) in Figure [1] showing 
some 13 consecutive heart beat cycles as recorded by one of the standard ECG electrodes. 
Individual beats are notable for features such as the sharp spike ("QRS complex") and the 
subsequent lower peak ("T wave"), shown schematically in the second panel. The presence 
of these local features, of differing spatial scales, suggests the use of wavelet bases for 
efficient representation. Traditional ECG analysis focuses on averages of a series of beats. 
If one were to look instead at beat to beat variation, one might expect these local features 
to play a significant role in the principal component eigenvectors. 
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Figure 1: (a) Sample of thirteen beats from one electrode of an electrocardiogram taken 
in the laboratory of Victor Froelicher, MD, Palo Alt o VA. (b) Cartoo n of the key features 
of the cardiac cycle reflected in the ECG trace, from iHamptonl ()1997l ). 



Returning to the general situation, the main contentions of this paper are: 

(a) that when p is comparable to n, some reduction in dimensionality is desirable before 
applying any PCA-type search for principal modes, and 

(b) the reduction in dimensionality is best achieved by working in a basis in which the 
signals have a sparse representation. 

We will support these assertions with arguments based on statistical performance and 
computational cost. 
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We begin, however, with an illustration of our results on a simple constructed ex- 
ample. Consider a single component (or single factor) model, in which, when viewed as 
p— dimensional column vectors 

Xi = Vip + azi, i = l,...,n (2) 

in which p G is the single component to be estimated, Vi ~ N{0, 1) are i.i.d. Gaussian 
random effects and Zi ~ Np{0,I) are independent p— dimensional noise vectors. 

True PC, p = 2048, n = 1024, ||p || = 10. a sample path from model (2) another sample path from model (2) 




500 1000 1500 2000 500 1000 1500 2000 500 1000 1500 2000 



standard PCA smoothed RCA ASPCA, w = 99.5%, k = 372. 
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Figure 2: True principal component, the "3-peak" curve. Panel (a): the single component pi = 
f{l/n) where f{t) = C{0.7S(1500, 3000) + 0.55(1200, 900) + 0.55(600, 160)} and B(a,b)[t) = 
[r(a + 6)/(r(a)r(6))]t"-i(l - tf-^ denotes the Beta density on [0, 1]. Panels (b,c): Two sample 
paths drawn from model (2) with a = 1. n = 1024 replications in total, p — 2048. (d): Sample 
principal component by standard PCA. (e): Sample principal component by smoothed PCA using 
A — 10^^^ and A = 10~^. (f): Sample principal component by sparse PCA with weighting function 
w = 99.5%, k = 372. 



Panel (a) of Figure [2] shows an example of p with p = 2048 and the vector pi = f{l/n) 
where f{t) is a mixture of Beta densities on [0, 1], scaled so that \\p\\ = i^^pfY^"^ = 10. 
Panels (b) and (c) show two sample paths from model ([2]): the random effect Vip is hard to 
discern individual cases. Panel (d) shows the result of standard PCA applied to n = 1024 
observations from ([2]) with a = 1. The effect of the noise remains clearly visible in the 
estimated principal eigenvector. 
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For functional data o f thi s type, a re g ulariz ed appro ach to PCA has been propose d by 
Rice &: Silyerman (1991) and Silyerman ( 19961 ). see also Ramsay Sz SilyermanI ( 199?! ) and 



references therein. While smoothing can be incorpora ted in various ways, we illustrate the 
method discussed also in Ramsay Sz SilyermanI ( 1997 . Ch. 7), which replaces ([1]) with 



\ar{ex.)/[Uf + Mm\\' 



(3) 



where -D^^ is the {p — 2) x 1 vector of second differences of ^ and A G (0, oo) is the 
regularization parameter. 

Panel (e) shows the estimated first principal component vector found by maximizing 
([3|) with A = 10~^^ and A = 10~^ respectively. Neither is really satisfactory as an estimate: 
the first recovers the original peak heights, but fails fully to suppress the remaining baseline 
noise, while the second grossly oversmooths the peaks in an effort to remove all trace of noise. 
Further investigation with other choices of A confirms the impression already conveyed here: 
no single choice of A succeeds both in preserving peak heights and in removing baseline noise. 

Panel (f ) shows the result of the adaptive sparse PCA algorithm to be introduced below: 
evidently both goals are accomplished quite satisfactorily in this example. 



2 The need to select subsets: (in)consistency of classical 
PCA 

A basic element of our sparse PCA proposal is initial selection of a relatively small subset 
of the initial p variables before any PCA is attempted. In this section, we formulate some 
(in) consistency results that motivate this initial step. 

Consider first the single component model ([2]). The presence of noise means that the 
sample covariance matrix S = n^^ "121=1 ^i^J '^ill typically have mm{n,p) non-zero eigen- 
values. Let p be the unit eigenvector associated with the largest sample eigenvalue — with 
probability one it is uniquely determined up to sign. 

One natural measure of the closeness of p to p uses the angle Z{p,p) between the two 
vectors. We decree that the signs of p and p be taken so that Z(p, p) lies in [0, vr/2]. It will 
be convenient to phrase the results in terms of an equivalent distance measure 

dist(^, p) = sin Z (p, p) = ^l-{pTpf. (4) 

For asymptotic results, we will assume that there is a sequence of models ([2]) indexed by 
n. Thus, we allow p{n) and p{n) to depend by n, though the dependence will usually not 
be shown explicitly. [Of course a might also be allowed to vary with n, but for simplicity 
it is assumed fixed.] 

Our first interest is whether the estimate p is consistent as n ^ oo. This turns out to 
depend crucially on the limiting value 

lim p{n) jn = c. (5) 

n— >oo 

We will also assume that 

lim ||p(n)|| = ^ > 0. (6) 

n— »oo 

One setting in which this last assumption may be reasonable is when p{n) grows by adding 
finer scale wavelet coefficients of a fixed function as n increases. 
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Theorem 1. Assume model and ([^. Define 

Then with probability one as n ^ oo, 

limsup sinZ(/5, p) < ({g/a,c), (7) 

n^oo 

SO long as the right side is at most one. 

For the proof, see Appendix IA.2[ The bound C{tjc) is decreasing in the "signal-to- 
noise" ratio r = g/a and increasing in the dimension-to-sample size ratio c = limp/n. It 
approaches as c — > 0, and in particular it follows that p is consistent if p/n ^ 0. 

The proof is based on an almost sure bound for eigenvectors of perturbed symmetric 
matrices. It appears to give the correct order of convergence: in the case p/n —> we have 

C{T,p/n) ~ c{T)y^p/n, 

with c(t) = 4r~^ -|- 8t~^, and examination of the proof shows that in fact 

Ap,p) = Op{y/p/^) 

which is consistent with the n~^/^ convergence rate that is typical when p is fixed. 

However if c > 0, the upper bound ([7]) is strictly positive. And it turns out that p must 
be an inconsistent estimate in this setting: 

Theorem 2. Assume model ^ and If p/n ^ c> 0, then p is inconsistent: 

liminf £'Z(p, /)) > 0. 

n— >oo 

In short, /) is a consistent estimate of p if and only if p = o(n). The noise does not 
average out if there are too many dimensions p relative to sample size n. A heuristic 
explanation for this phenomenon is given just before the proof in Appendix IA.3I 

The inconsistency criterion extends to a considerably more general multi- component 
model. Assume that we have n curves Xj, observed at p time points. Viewed as p dimen- 
sional column vectors, this model assumes that 

m 

Xi = p + '^vjp^ + azi, i = l,---,n. (8) 

i=i 

Here ji is the mean function, which is assumed known, and hence is taken to be zero. We 
make the following assumptions: 

(a) The />',j = l,...,m < p are unknown, mutually orthogonal principal components, 
with norms Pj{n) = \\p^\\ 

||p^||>||p'||>--->||/^™||. (9) 

(b) The multipliers ~ -/V(0, 1) are all independent over j = 1, . . . , m and i = 1, . . . ,m. 

(c) The noise vectors Zi ~ Np{0,I) are independent among themselves and also of the 
random effects {vf}. 

For asymptotics, we add 
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(d) We assume that p{n),m{n) and {f>'{n),j = 1,... ,m} are functions of n, though 
this will generally not be shown explicitly. We assume that the norms of the n*^ principal 
components converge as sequences in ^i(N): 

g{n) = {\\pHn)l...,y{n)l...) ^^^^ 
We write for the limiting £i norm: 

j 

Remark on Notation. The index j, which runs over principal components, will be 
written as a superscript on vectors , p' and u-' (defined in Appendix), but as a subscript 
on scalars such as Qj{n) and qj. 

We continue to focus on the estimation of the principal eigenvector p^, and establish a 
more general version of the two preceding theorems. 

Theorem 3. Assume model l^j together with conditions (a) -(d). If p/n ^ c, then 
limsupsinZ(p\ < -^^^^[pj^ + {2 + ^/c)a] 

n^co Qi — Q2 

SO long as the right side is at most, say, 4/5- 
// c > 0, then 

liminf EZ{p^,p^) > 0. 

n— >oo 

Thus, it continues to be true in the multicomponent model that p^ is consistent if and 
only if p = o{n). 



3 The sparse PC A algorithm 

The inconsistency results of Theorems [2] and [3] emphasize the importance of reducing the 
number of variables before embarking on PCA, and motivate the sparse PCA algorithm to 
be described in general terms here. Note that the algorithm per se does not require the 
specification of a particular model, such as ([8]) . 

1. Select Basis. Select a basis {e^} for and compute co-ordinates {xiy) for each xi 
in this basis: 

Xi{t) = ^^Xiyey{t), i = l,...,n. 

V 

[The wavelet basis is used in this paper, for reasons discussed in the next subsection.] 

2. Subset. Calculate the sample variances al = Var{xiy). Let / denote the set of 
indices v corresponding to the largest k variances. 

[k may be specified in advance, or chosen based on the data, see Section [3.21 below] . 

3. Reduced PCA. Apply standard PCA to the reduced data set {xiy, E /, i = 1, . . . , n} 
on the selected /c— dimensional subset, obtaining eigenvectors fP = {pi),j = 1, . . . ,k. 

4. Thresholding. Filter out noise in the estimated eigenvectors by hard thresholding 
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[Hard thresholding is given, as usual, by riH{x,6) = xl{|x| > 5}. An alternative is soft 
thresholding i]s{x, 5) = sgn(2;)(|x| — (5)+, but hard thresholding has been used here because 
it preserves the magnitude of retained signals. 

The threshold 6 can be chosen, for example, by trial and error, or as 5 = fj^/2Togk 
for some estimate fj. In this paper, estimate is used. Another possibility is to set 
fj = MAD{pl, 1/ = 1, . . . , A;}/0.6745. ] 

5. Reconstruction. Return to the original signal domain, setting 

V 

In the rest of this section, we amplify on and illustrate various aspects of this algorithm. 
Given appropriate eigenvalue and eigenvector routines, it is not difficult to code. For 
example, MATLAB files that produce most figures in this paper will soon be available at 



www-stat.staiiford.edu/~imj/ - to exploit wavelet bases, they make use of the open- 



source library WaveLab available at www-stat . Stanford. edu/~wavelab/ 



3.1 Sparsity and Choice of basis 

Suppose that in the basis {ey{t)} a population principal component p{t) has coefficients 

p 

p{t) = ^p^ey{t). 

v=\ 

It is desirable, both from the point of view of economy of representation, as well as 
computational complexity, for the expansion in basis {ey \ to be sparse, i.e., most coefficients 
Piy are small or zero. 

One way to formalize this is to require that the ordered coefficient magnitudes decay at 
some algebraic rate. We say that p is contained in a weak £q ball of radius C, p G wiq{C), 
if |/5|(i) > \p\(2) > ... and 

\p\iu)<Cu-'^'^, u = l,2,... 

Wavelet bases typically provide sparse representations of one-dimensional functions that 
are smooth or have isolated singularities or transient features, such as in our EGG ex- 
ample. Here is one such result. Expand p in a nice wavelet basis {ipjk{t)} to obtain 
p = 'YlijkPjk'^jkii) and then order coefficients by absolute magnitude, so that [p^) is a 
re-ordering of the \pjk\ in decreasing order. Then smoothness (as measured by membership 
in some Besov space B^ ^) implies sparsity in the sense that 

peB^q {pu)£wip, p = 2/{2a + l). 



[for details, see Donoho ( 19931 ) and Johnstone ( 20021 ): in particular it is assumed that 



a > (1/p — 1/2)+ and that the wavelet ■0 is sufficiently smooth.] 

In this paper, we will assume that the basis {ci,} is fixed in advance ~ and it will generally 
be taken to be a wavelet basis. Extension of our results to incorporate basis selection (e.g. 
from a library of orthonormal bases such as wavelet packets) is a natural topic for further 
research. 
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3.2 Adaptive choice of k 

Here are two possibilities for adaptive choice of A; = |/| from the data: 

(a) choose co-ordinates with variance exceeding the estimated noise level by a specified 
fraction On- 

i = {u : al>a\l + an)}. 

This choice is considered further in Section [3.51 

(b) As motivation, recall that we hope that the selected set of variables / is both small in 
cardinality and also captures most of the variance of the population principal components, 
in the sense that the ratio 

is close to one for the leading population principal components in {p^, . . . ,/?'"}. Now let 
^(n) a denote the upper a— percentile of the xfn) distribution - if all co-ordinates were pure 
noise, one might expect cr^^^ to be close to n~^a'^xfn) u/n- Define the excess over these 
percentiles by 

-2 r"2 -1-2 2 n^ 

^(u) = max{fT(^) - n a X{n),u/n^ 0)' 
and for a specified fraction w{n), set 

k 

where k is the smallest index k for which the inequality holds. This second method has 
been used for the figures in this paper, typically with w{n) = .995. 

Estimation of a. If the population principal components have a sparse representation 
in basis {e^}, then we may expect that in most co-ordinates ly, {xi^} will consist largely of 
noise. This suggests a simple estimate of the noise level on the assumption that the noise 
level is the same in all co-ordinates, namely 

(j^ = median((T^). (11) 

3.3 Computational complexity 

It is straightforward to estimate the cost of sparse PCA by examining its main steps: 

1. This depends on the choice of basis. In the wavelet case no more than 0{nplogp) 
operations are needed. 

2. Sort the sample variances and select /: 0{plogp). 

3. Eigendecomposition for a k x k matrix: 0{k^). 

4. Estimate o"^ and ||/5p: 0{p). 

5. Apply thresholding: 0{k). 

6. Reconstruct eigenvectors in the original sample space: 0{k^p). 
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Hence, the total cost of sparse PCA is 



0{nplogp + k^p). 

Both standard and smoothed PCA need at least 0((p A n)^) operations. Therefore, if 
we can find a sparse basis such that k/p ^ 0, then under the assumption that p/n — >■ c as 
n — > oo,the total cost of sparse PCA is o{p^). We will see in examples to follow that the 
savings can be substantial. 

3.4 Simulated examples 

The two examples in this section are both motivated by functional data with localized 
features. 



True PG, p = 2048, n = 1 024, || p || » 25. 



Standard PGA 
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Smoothed PGA, b:10"'% r:10 , m:10 



ASPGA, w = 99.5%, k = 438. 
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Figure 3: Comparison of the sample principal components for a step function. (a) True 

principal component pi = f{l/n), the "step" function (b): Sample principal component by standard 
PCA. (c): Sample principal component by smoothed PCA using A = 10"^^, 10~^ and 10^^. (d): 
Sample principal component by sparse PCA with weighting function w = 99.5%, k — 438. 



The first is a three-peak principal component depicted in Figure[2l and already discussed 
in Section [TJ The second example, Figure O has an underlying first principal component 
composed of step functions. For both examples, the dimension of data vectors is p = 2048, 
the number of observations n = 1024, and the noise level a = 1. However, the amplitudes 
of p differ, with ||yo|| = 10 for the "3-peak" function and ||yo|| ~ 25 for the "step" function. 

Panels (d) and (b) in the two figures respectively show the sample principal components 
obtained by using standard PCA. While standard PCA does capture the peaks and steps, 
it retains significant noise in the flat regions of the function. Corresponding panels (e) and 
(c) show results from smooth PCA with the indicated values of the smoothing parameter. 
Just as for the three peak curve discussed earlier, in the case of the step function, none of 
the three estimates simultaneously captures both jumps and flat regions well. 

Panels (f) and (d) present the principal components obtained by sparse PCA. Using 
method (b) of the previous section with w = 99.5%, the Subset step selects k = 372 and 438 
for the "3-peak" curve and "step" function, respectively. The sample principal component 
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Standard 
PCA 


Smoothed 
A : 10-12 


Smoothed 
A : 10-6 


Sparse 
PCA 


ASE (3-peak) 


9.681e-04 


1.327e-04 


3.627e-2 


7.500e-05 


Time (3-peak) 


~ 12min 


~ 47 min 


~ 43 min 


1 min 15 s 


ASE (step) 


9.715e-04 


3.174e-3 


1.694e-2 


1.947e-04 


Time (step) 


12min 


~ 47 min 


~ 46 min 


1 min 31 s 



Table 1: Accuracy and efficiency comparison 



in Figure [Hd) is clearly superior to the other sample p.c.s in Figure [2j Although the 
principal component function in the step case appears to be only slightly better than the 
solid blue smooth PCA estimate, we will see later that its squared error is reduced by more 
than 90%. 

Table [T] compares the accuracy of the three PCA algorithms, using average squared 
error (ASE) defined as 

ASE = p'^\\p- pf. 

The average ASE over 50 iterations is shown. The running time is the CPU time for a 
single iteration used by Matlab on a MIPS RIOOOO 195.0MHz server. 

Figure [H presents box plots of ASE for the 50 iterations. Sparse PCA gives the best 
result for the "step" curve. For the "3-peak" function, in only a few iterations does sparse 
PCA generate larger error than smoothed PCA with a small A = lO-^^, Qn the average, 
ASE using sparse PCA is superior to the other methods by a large margin. Overall Table [1] 
and Figure [5] show that sparse PCA leads to the most accurate principal component while 
using much less CPU time than other PCA algorithms. 



'3-peak' ASE, 50 iterations 



standard Smddlh Smdoth Sparse 



'Step' ASE, 50 iterations 



standard Smooth Smootti Spar 



Figure 4: Side-by side box-plots of ASE from 50 iterations using different algorithms, (a) For the 
"3-peak" function, (b) For the "step" function. 



Remarks on the single component model. 
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Anderson (1963) obtained the asymptotic distribution of y/n{p — p) for fixed p; in 
particular 

Var{ - pl)} ^ {\\pf + ^')p4 (1 - (12) 



as n — > oo. For us, p increases with n, but we wih nevertheless use (jl2p as an heuristic basis 
for estimating the variance f needed for thresholding. Since the effect of thresholding is to 
remove noise in small coefficients, setting pi, to in (jl2p suggests 



l^ aV\\pr + a^ 



(13) 



Neither WpW^ and in p3p are known, but they can be estimated by using the informa- 
tion contained in the sample covariance matrix S, much as in the discussion of Section [3. 2[ 
Indeed S^, the v-th. diagonal element of S, follows a scaled distribution, with expectation 
p^ + a^. If pu is a sparse representation of /?, then most coefficients will be small, suggesting 
the estimate (llip for a^. In the single component model. 



which suggests as an estimate: 



= ^{52-median(52)}. (14) 
1 

Figure [5] shows the histograms for these estimates of \\p\\ and a based on 100 iterations for 
the "3-peak" curve and for the "step" function. 



3.5 Correct Selection Properties 

A basic issue raised by the sparse PCA algorithm is whether the selected subset / in fact 
correctly contains the largest population variances, and only those. We formulate a result, 
based on large deviations of variables, that provides some reassurance. 

For this section, assume that the diagonal elements of the sample covariance matrix 
S = XixJ have marginal distributions, i.e., 

^1 = S^u ^ (ylxln)/^^ jy = I, (15) 

We will not require any assumptions on the joint distribution of {<7^}. 

Denote the ordered population coordinate variances by o"^^^ > a'^^) ^ • • • and the ordered 

sample coordinate variances by (7^^^ > a'^^) > • • •• A desirable property is that / should, for 
suitable small, 

(i) include all indices / in 

lin = {I ■■ erf > o"^fc)(l + an)}, and 

(ii) exclude all indices I in 

lout = {I ■■ erf < o-Jfc)(l - an)}. 
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(a) 3-peak, estimate for || p || (b) 3-peak, estimate for a 





0.999 0.9995 1 1.0005 1.001 1.0015 
(d) step, estimate for o 




23 24 25 26 1.001 1.002 1.003 1.004 



Figure 5: Histograms from 100 iterations. The "3-peak" function, (a) estimate for \\p\\ = 10: mean 
= 9.91, SD = 0.24. (b): estimate for cr = 1: mean = 1.0005, SD = .0006. The "step" function, (c): 
estimate for ||p|| = 24.82: mean = 24.58, SD = 0.56. (d): estimate for a = 1: mean = 1.0029, SD 
= .0007. 



We will show that this in fact occurs if a„ = 7y^n ^logn, for appropriate 7 > 0. 
We say that a false exclusion (FE) occurs if any variable in /j„ is missed: 

FE=[j {af < ay, 
while a false inclusion (FI) happens if any variable in lout is spuriously selected: 

Theorem 4. Under assumptions M5\). the chance of an inclusion error of either type in 
Ik having magnitude an = 7n~^/^(log n)^/^ is polynomially small: 

P{FE U FI] < 2pA;n-^(^) + fcn-^i-^"")''^, 

with b{-f) = [7\/3/(4 + 2^/3)]2. 

For example, if 7 = 9, then 6(7) = 4.36. As a numerical illustration based on (j54p 
below, if the subset size k = 50, while p = n = 1000, then the chance of an inclusion 
error corresponding to a 25% difference in SDs (i.e. ^/T+'a^ = 1.25) is below 5%. That 
reasonably large sample sizes are needed is a sad fact inherent to variance estimation — as 
one of Tukey's 'anti-hubrisines' puts it, "it takes 300 observations to estimate a variance to 
one significant digit of accuracy" . 

3.6 Consistency 

The sparse PCA algorithm is motivated by the idea that if the p.c.'s have a sparse repre- 
sentation in basis {ci,}, then selection of an appropriate subset of variables should overcome 
the inconsistency problem described by Theorem [2j 
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To show that such a hope is justified, we estabhsh a consistency result for sparse PCA. 
For simplicity, we consider the single component model ([2]), and assume that cx^ is known — 
though this latter assumption could be removed by estimating o"^ using pT|) . 

To select the subset of variables /, we use a version of rule (a) from Section 13.21 

i = {u : al>a\l+jn)}, (16) 

with 7„ = j{n~^ log n)^/^ and 7 a sufficiently large positive constant — for example 7 > VT2 
would work for the proof. 

We assme that the unknown principal components p = p{n) satisfy a uniform sparsity 
condition: for some positive constants q,C, 

p{n) £ wlq{C) uniformly in n. (17) 

Let Pi denote the principal eigenvector estimated by step (3) of the sparse PCA algorithm 
(thresholding is not considered here). 

Theorem 5. Assume that the single component model ^ holds, with p/n ^ c > and 
\\p{n)\\ Q > For each n, assume that p[n) satisfies the uniform sparsity condition 

Then the estimated principal eigenvector pj obtained by subset selection rule U6\) is 
consistent: 

Z(P7,P) -0. 

The proof is given in Appendix lA.5t it is based on a correct selection property similar to 
Theorem U) combined with a modification of the consistency argument for Theorem [3j In 
fact, the proof shows that consistency holds even under the weaker assumption p = 0{n°'), 
for arbitrary a > 0, so long as 7 = 7(a) is set sufficiently large. 

3.7 ECG example 

This section offers a brief illustration of sparse PCA as applied to some ECG data kindly 
provided by Jeffrey Froning and Victor Froelicher in the cardiology group at Palo Alto Vet- 
erans Affairs Hospital. Beat sequences - typically about 60 cycles in length - were obtained 
from some 15 normal patients: we have selected two for the preliminary illustrations here. 

Data Preprocessing. Considerable preprocessing is routinely done on ECG signals 
before the beat averages are produced for physician use. Here we describe certain steps 
taken with our data, in collaboration with Jeff Froning, preparatory to the PCA analysis. 

The most important feature of an ECG signal is the Q-R-S complex: the maximum 
occurs at the R-wave, as depicted in Figure [T]^b) . Therefore we define the length of one 
cycle as the gap between two adjacent maxima of R-waves. 

1. Baseline wander is observed in many ECG data sets, c.f. Figure [6l One common 
remedy for this problem is to deduct a piecewise linear baseline from the signal, the linear 
segment (dashed line) between two beats being determined from two adjacent onset points. 

The onset positions of R-waves are shown by asterisks. Their exact locations vary for 
different patients, and as Figure [6] shows, even for adjacent R-waves. The locations are 
determined manually in this example. To reduce the effect of noise, the values of onset 
points are calculated by an average of 5 points close to the onset position. 

2. Since pulse rates vary even on short time scales, the duration of each heart beat 
cycle may vary as well. We use linear interpolation to equalize the duration of each cycle. 
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Figure 6: ECG baseline wander. 



and for convenience in using wavelet software, discretize to 512 = 2 sample points in each 
cycle. 

3. Finally, due to the importance of the R-wave, the horizontal positions of the maxima 
are the 150th position in each cycle. 

4. Convert the ECG data vector into an n x p data matrix, where n is the number of 
observed cycles and p = 512. Each row of the matrix presents one heart beat cycle with 
the maxima of R-waves all aligned at the same position. 

PCA analysis. Figure [7] (a) and (d) shows the mean curves for two ECG samples in 
blue. The number of observations n, i.e. number of heart beats recorded, are 66 and 61, 
respectively. The first sample principal components for these two sample sets are plotted 
in plots (c) and (f), with red curves from standard PCA and blue curves from sparse PCA. 
In both cases there are two sharp peaks in the vicinity of the QRS complex. The first peak 
occurs shortly before the 150th position, where all the maxima of R-waves are aligned, and 
the second peak, which has an opposite sign, shortly after. 

The standard PCA curve in Figure 6.7.(b, red) is less noisy than that in panel (d, red), 
even allowing for the difference in vertical scales. Using (jllh . 



while the magnitudes of the two mean sample curves are very similar. 

The sparse PCA curves (blue) are smoother than the standard PCA ones (red), espe- 
cially in plot (d) where the signal to noise ratio is lower. On the other hand, the red and 
blue curves match quite well at the two main peaks. Sparse PCA has reduced noise in the 
sample principal component in the baseline while keeping the main features. 

There is a notable difference between the two estimated p.c.'s. In the first case, the p.c. 
is concentrated around the R-wave maximum, and the effect is to accelerate or decelerate 
the rise (and fall) of this peak from baseline in a given cycle. This is more easily seen by 
comparing plots of x + 2p (green) with x — 2p (red), shown over a magnified part of the 
cycle in panel (b). In the second case, the bulk of the energy of the p.c. is concentrated in 
a level shift in the part of the cycle starting with the ST segment. This can be interpreted 
as beat to beat fluctuation in baseline ~ since each beat is anchored at at the onset point, 
there is less fluctuation on the left side of the peak. This is particularly evident in panel 



aj = 24.97 and = 82.12. 
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Figure 7: ECG examples, (a): mean cm've for EGG sample 1, n ~ 66, in blue, along with x + 2p 
(green) and x — 2p (red), with p being the estimated first principal component from sparse PGA 
(see also (c)). (b) Magnified section of (a) over the range 120-220. (c): First principal components 
for sample 1 from standard (red) and sparse PGA (blue), (d)- (f): corresponding plots for sample 
2, n = 61. 



(e) ~ there is again a slight acceleration/deceleration in the rise to the R wave peak - less 
pronounced in the first case, and also less evident in the fall. 

Obvious questions raised by this illustrative example include the nature of effects which 
may have been introduced by the preprocessing steps, notably the baseline removal anchored 
at onset points and the alignment of R-wave maxima. Clearly some conventions must be 
adopted to create rectangular data matrices for p.c. analysis, but detailed analysis of these 
issues must await future work. 

Finally, sparse PCA uses less than 10% of the computing time than standard PCA. 

A Appendix 
A.l Preliminaries 

Matrices. We first recall some pertinent matrix results. Define the 2— norm of a rectan- 
gular matrix by 

PII2 =sup{Px||2 : ||x||2 = 1}. (18) 
If A is real and symmetric, then ||^||2 = ^maxiA). If Apxp is partitioned 
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where b is {p — 1) x 1, then by setting x = {1 0^)'^ in (jlSp . one finds that 

ll&l|2<PI|2. (19) 

The matrix B = pu^ + up^ has at most two non-zero eigenvalues, given by 

\={T±l)\\p\\\\ul r = /n/llp|lll^^ll- (20) 

Indeed, the identity det(I + AC) = d.et{I + CA) for compatible rectangular matrices A and 
C means that the non-zero eigenvalues of 



B={p n)(^^) 



are the same as those of the 2x2 matrix 



T\ / II INI II II Ii2 

u \ / ^ ( t\\p\\ \\u\\ 

IpIP ''"IIpIIII^I 



from which (1201) is immediate. 



Angles between vectors. We recall and develop some elementary facts about angles 
between vectors. The angle between two non-zero vectors i^, r/ in W is defined as 

^(^'^) = TSnrr^^[0'^/2]- (21) 

IIC||2||??||2 

Clearly Z(a(^,brj) = Z{S^,rj) for non-zero scalars a and 6; in fact ^i-,-) is a metric on one- 
dimensional subspaces of MP. If ^ and rj are chosen to be unit vectors with ^^rj > 0, 
then 

||e-7?||2 = 2siniZ(^,7?). (22) 

The sine rule for plane triangles says that if ^, t] are non-zero and linearly independent 
vectors in M^, then 

sinZ(e,r/) = fc|^sinZ(C-r?,r/). (23) 

These remarks can be used to bound the angle between a vector ry and its image under 
a symmetric matrix M in terms of the angle between rj and any principal eigenvector of M. 

Lemma 1. Let ^ he a principal eigenvector of a non-zero symmetric matrix M . For any 

Z(??,Mr?) <3Z(r?,e). 

Proof. We may assume without loss of generality that ||^|| = ||?7|| = 1 and that ^^^r] > 0. 
Since ^ is a principal eigenvector of a symmetric matrix, ||M,^|| = ||M||. From the sine rule 

(Hal), 



sinZ(M^,Mr/) < ||M^ - Mr/||/||M^|| 

< lle-r/ll =2siniZ(e,r/), 

where the final equality uses (p2]) . Some calculus shows that 2sinQ/2 < sin 2a for < a < 
7r/4 and hence 

Z(MC,M7?) <2Z(e,r/). (24) 
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From the triangle inequality on angles, 



Z(?7, Mt]) < Z{rj, Mi) + Z(M^, Mry) 
<3Z(??,e), 

using (j24p and the fact that ^ is an eigenvector of M. 



□ 



Perturbation bounds. Suppose that a symmetric matrix Ap^p has unit eigenvector 
§1. W e wish to bound the effect of a symmetric pertur bation -Epxp on qi . T he following 
result (jGolub &: Van LoanI (Il996l . Thm 8.1.10), see also lStewart fc Sun (Il99nl )^ constructs 
a unit eigenvector qi oi A + E and bounds its distance from qi in terms of ||-E||2- Here, the 
distance between unit eigenvectors qi and qi is defined as at ([4]) and (f2T|) . 

Let Qpxp = [^1 Q2] be an orthogonal matrix containing qi in the first column, and 
partition conformally 



Q^AQ 



A 
D22 



Q^EQ 



T 

e e 
e E22 



where D22 and E22 are both (p — 1) x {p — 1). 

Suppose that A is separated from the rest of the spectrum of A; set 

5 = min |A — 

M6A{D22) 

If \\E\\2 < <J/5, then there exists r G W^'^ satisfying 

||r||2 < (4/<5)||e||2 

such that 

qi = (l + r^r)-i/2(gi+Q2r) 
is a unit eigenvector of A + E. Moreover, 

dist(gi,gi) < (4/5)||e||2. 

Let us remark that since ||e||2 < \\E\\2 by (jl9p . we have ||r||2 < 1 and 



(25) 



gfgi = (l + ||r||2)-i/2>i/^. 



(26) 



Suppose now that qi is the eigenvector of A associated with the principal eigenvalue 
Xi{A). We verify that, under the preceding conditions, qi is also the principal eigenvector 
oi A + E: i.e. if {A + E)qi = A*gi, then in fact A* = Xi{A + E). 

To show this, we verify that A* > \2{A + E). Take inner products with qi in the 
eigenequation for qi: 

X*qJqi=qjAqi + qjEqi. (27) 

Since A is symmetric, qjA = \i{A)qf . Trivially, we have qfEqi > — ||i?||2. Combine these 
remarks with (j26p to get 

X* > Xi{A) - V2\\E\\2. 
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N ow (5 = Ai (A) — A2 ( A) an d since from the minimax charact erization of eigenvalues 
(e.g. iGolub Van LoanI (j 19961 . p. 396) or IStewart k SunI (|l99d . p.218)), X2{A + E) < 
X2{A) + 11-^112, we have 

A* - A2(^ + E)>5-{1 + V2)\\E\\2 

> 6[l - (1 + V2)/5] > 0, 

which is the inequality we seek. 

Large Deviation Inequalities. If X = Yli -^i is the average of i.i.d. variates 
wit h moment generating func tion exp{A(A)} = Eexp{XXi}, then Cramer's theorem (see 
e.g. bembo fc Zeitounil (|l993l . 2.2.2 and 2.2.12)) says that for x > EXi, 



P{X >x}< exp{-nA*(x)}, 



(28) 



where the conjugate function A*(x) = sup;^{Ax — A(A)}. The same bound holds for P{X < 
x} when x < EXi. 

When applied to the xfn) distribution, with Xi = zf and zi ~ A^(0, 1), the m.g.f. 
A(A) = — ^ log(l — 2A) and the conjugate function A*(x) = ^[x — 1 — logx]. The bounds 



log(l + e) < 



e - eV2 
e - 3eV8 



-1< e < 0, 



< e < 



2' 



(the latter following, e.g., from (47) in I Johnstond (I2OOII )) yield 

Pixfn) < ^(1 - e)} < exp{-neV4}, < e < 1, 
Plxfn) > '^(l + e)} < exp{-3neVl6}, < e < i. 

We will use also a slightly sharper bound 

P{xl^)>n + tV2^}<t-^e-''/^ 



(29) 
(30) 

(31) 



valid for n > 16 and < t < n 

V6 djohnstondboOll '). 

When applied to sums of variables Xi = Z1Z2, with zi and Z2 independent A'^(0, 1) 
variates, the m.g.f. A(A) = -^log(l - A^). With A*(x) = [(1 + 4x^)1/2 _ l]/(2x), the 
conjugate function satisfies 

A*(x) = A,x + i log(l - Xl) = (3/2)x2 + 0(x4), 

clS X — > 0. Hence, for n large, 

P{X > ^/bn~nogn} < CrT^^I'^. (32) 

Decomposition of sample covariance matrix. Now adopt the multicomponent 
model (j8]) along with its assumptions (a) - (c). The sample covariance matrix S = 
Xli ^i'^J tiEis expectation £^5 = i? + cr^/p, where 



(33) 
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Now decompose S according to dH]). Introduce Ixn row vectors v^"^ = ('^'i " " " Vn) and collect 
the noise vectors into a matrix Zpxn = [-^i • • • Zn]. We then have 

m m 

S-ES= ^^'' + Y1 + 

j,k=l j=l 

where the p x p matrices 

n 



i=l 



(35) 



Some limit theorems. We turn to properties of the noise matrix Z appearing in ([35]). 
The cross products matrix ZZ'^ has a standard p-dimensional Wisha rt W^jn^ I) distri bution 
with n degrees of freedom and identity covariance matrix, see e.g. iMuirheadI (jl982l . p82). 
Thus the matrix C = {cjk) in (|34p is simply a scaled and recentered Wishart matrix. We 
state results below in terms of either ZZ'^ or C, depending on the subsequent application. 
Properties (b) and (c) especially play a key role in inconsistency when c > 0. 



(a) li p = 0{n), then for any 6 > 8, 



max |cjfc| < cr 



blogn 



n 



a.s. 



as n 



oo. 



(36) 



Cjk) has the 



Proof. We may clearly take a = 1. An off-diagonal term in n~^ZZ'^ — y^^j 
distribution of an i.i.d. average X = n~^^Xj where Xi = z\Z2 is the product of two 
independent standard normal variates. Thus 



P{max \cjk\ > x] < 2p^P{X > x}. 



(37) 



Now apply the large deviation bound (j32|) to the right hand side. Since p ~ cn, the 
Borel-Cantelli lemma suffices to establish ()36p for off-diagonal elements for any b > 2. 



A diagonal term Cjj + 1 in n ^ZZ'^ has the n ^xfn) distribution. Setting t = y ^fclog n 
in ([3T]) yields 

-1/2, 



P{cjj > \Jhn-^ logn} < \/2(61og 



n) 



n 



-6/4 



Since there are p ^ cn diagonal terms, the conclusion (j36p follows (again via Borel-Cantelli) 
so long as 6 > 8. □ 



(b) iGemanI k^?>d \ and lSilversteh] (|l98,5l ) respectively established almost sure limits for 
the largest and smallest eigenvalues of a Wp{n, I) matrix as p/n — > c E [0, oo), from which 
follows: 

Xi{C),Xp{C) ^ a\c±2^). (38) 

[Although the results in the papers cited are for c G (0, oo), the results are easily extended 
to c = by simple coupling arguments.] 
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(c) Suppose in addition that u is a 1 x n vector with independent A^(0, 1) entries, which 
are also independent of Z. Conditioned on u, the vector Zv is distributed as A'p(0, ||f 
Since Z is independent of f , we conclude that 

= X{n)X{p)Up (39) 

where xfn) ^(p) denote chi-square variables and Up a vector uniform on the surface of 
the unit sphere S^^^ in M^, and all three variables are independent. 
Now let Upxi = (jn^^Zv. From (|39p we have 

II ||2 ^ 2 -2 2 2 2 rAn\ 

Ml n X(n)X(p) ^ (7 c, (40) 

as p/n — > c E [0, oo). 

If p is any fixed vector in M^, it follows from (j39p that 

r = t{p) = p^u/IIpIIII^II = Up,i, 

the distribution of the first component of Up. It is well known that Uf ~ Beta(l/2, (p— 1) /2), 
so that EUf = p^^ and VaiUf < 2p~^. From this it follows that 

t{p) 0, p^ oo. (41) 

(d) Let = an^^Zv^ be the vectors appearing in the definition of for 1 < j < m. 
We will show that a.s. 

lim sup I In-' II < cq (42) 

n— >oo j 

(the constant cq = 2a{l + y^) would do). 
Proof. Since 

\\u^f =a\-\^'^Z''Zv^ 

we have 

supllu-'ll^ < a'^n^^XmaxiZZ'^)sup\\v^\'^ /n. (43) 

From (j38p . it follows that w.p. 1, ultimately 

A^,,(ZZ^)/n<2(l + V^)2. (44) 

The squared lengths ||w-'|p follow independent xfn) Since from (|28p there exists 

ci for which ^ 2n} < e~'^^^ for n > ng, it follows that 

P{sup||f^ ||Vf^ > 2} <pe^^i" 

and so w.p. 1 it is ultimately true that 

supllt-^'llVn < 2. (45) 
j 

Substituting (01]) and dUD into (gSD, we recover 1^. □ 
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A. 2 Upper Bounds: Proof of Theorems [T] and [3] 

Instead of working directly with the sample covariance matrix S, we consider S* = S — a'^Ip. 
It is apparent that S* has the same eigenvectors as S. We decompose S* = R + E, where 
R is given by (|33|) and has spectrum 



A(i?)={||pif,---,||p-f,0}. 
The perturbation matrix E = A + B + C, where A and B refer to the sums in ([3 



Proposition 1. Assume that multicomponent model holds, along with assumptions (a) 
- (d). For any e > 0, ifp,n oo, p/n c, then almost surely 

limsup ||£;||2 < a^/c'^gj + a^{c + 2^/c). (46) 
Proof. We will obtain a bound in the form 

ll^lb < En{uj) = An{uj) + Bn{uj) + C„(cj), 

where the An,Bn and C„ will be given below. We have shown explicitly the dependence 
on Lo to emphasize that these quantities are random. Finally we show that the a.s. limit of 
En{Lo) is the right side of (|i6]) . 

A term. Introduce symmetric matrices 2A^^ = A^^ + A^^ = vi^{f>' p^'^ + p'^/>'"^). Since 
fp and p'^ are orthogonal, ()20p implies that 

||i^''=||2<|^f| 11/^11 11/11, 

and so 

II ^^^-^Ib < max|t;f |(^ ll/ll)' =: 

' 3 

The vi^ are entries of a scaled and recentered Wmin, I) matrix, and so by (j36p . the maximum 
converges almost surely to 0. Since \\p^\\ YliBj < oo, it follows that the A„-term 
converges to zero a.s. 

B term. Applying (j20p to the definition (j35p of B^ , we have 



Wh < XnU) = (1 + |r,|)||p^||h^|| '^-4- aVcQj 
where Tj = /9^-^?i-'/||/9^ || Hn-' || and = an^^Zv-', and the convergence follows from (|40p and 

Since |rj| < 1 and using ()42p . we have a.s. that for n > n{u)), 

Xn{j) < ynii) ■■= 2Co||/|| ^ 2co0j. 

The norm convergence (llOp implies that Ylj^nij) — > 2cn Pi an d so it follows from the 
version of the dominated convergence theorem due to Pratt] (|l960l ) that 

^ ||i?^||2 < =: Bniu;) ''^^ aV~cY, B3- 

3 

C term. Using ([38]) . 

Cnioj) = \\C\\2 = X^a.{C) "4- a\c + 2^c). 

□ 
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Proof of Theorem [3] [Theorem [T] is a special case.] We apply the perturbation theorem 
with A = R and E = A -\- B + C . The separation between the principal eigenvalue of R 
and the remaining ones is 

= pl{n) - pl{n) ^ pI- pI, 
while from Proposition [1] we have the bound 

\\E\\2 < En{uj) "-4- a^CQ+ + cj2(c + 2^). 

Consequently, if 
then 

limsup d\si{p^ ^ p^) < i}{p,c;a), 

where 

n{p, c; a) = Aa^/Z[Q+ + a{VZ + 2)]/{qI - el). 



A. 3 Lower Bounds: Proof of Theorem [2] 

We begin with a heuristic outline of the proof. We write S in the form D + B, introducing 

D = {1 + Vs)pp^ + a^n-^ZZ^ , 

while, as before, B = piF + u(?- and u = an~^Zv. 

A symmetry trick plays a major role: write S- = D — B and let />_ be the principal 
unit eigenvector for 5_. 

The argument makes precise the following chain of remarks, which are made plausible 
by reference to Figure [51 




Sx = Dx + Bx = lam x 



Figure 8: Needs caption, with x p, lam ^ A 

(i) Bp is nearly orthogonal to Dp + Bp = Sp = Xp. 

(ii) the side length \\Bp\\ is bounded away from zero, when c > 0. 

(iii) the angle between p and S-p is "large", i.e. bounded away from zero. 

(iv) the angle between p and p- is "large" [this follows from Lemma [T] applied to 
M = 5_.] 

(v) and finally, the angle between p and p must be "large", due to the equality in 
distribution of p and p_. 
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Getting down to details, we will establish (i)-(iii) under the assumption that p is close 
to p. Specifically, we show that given 5 > small, there exists a{5) = q((5;cj, c) > such 
that w.p. 1, 

Z(p,p)<<5 ^ Z(p,5_p) >a(5). 

Let Ns = {x ^ W : Z{x,p) < 5} be the (two-sided) cone of vectors making angle at 
most 5 with x. We show that on A'^^, both 
(ii') ||-Bx|| is bounded below (see (I49p ). and 
(i') Bx is nearly orthogonal to x (see (j50|) . 

For convenience in this proof, we may take \\p\\ = 1. Write x S Ng^ in the form 

X = {cos 5) p + (sin 5)r] , r/ _L p, ||r/|| = 1, < 5 < 5i. (47) 

Since Bp = {u^ p)p + u and Brj = {u'^r])p, we find that 

Bx = {cos 6)u + [{cos 6){u^p) + {sin 6) {u^i])] p. (48) 

Denote the second right side term by r: clearly ||r|| < |n^p| + (sin(5)||n||, and so, uniformly 
on Ns, 

\\Bx\\ > (cos (5 — sin5)||ti|| — {u^ p\. 
Since both ||n|| — > a^/c and u^p a.s., we conclude that w.p. — > 1, 

inf\\Bx\\ > la^/ccos6. (49) 

Ns 

Turning to the angle between x and Bx, we find from (j47p and ()48p that 
x^Bx = 2(cos^ 5){p^u) + 2 cos 5 sm6{u^r]), 
and so, uniformly over A'^^, 

\x'^Bx\ < 2 cos^ 6\p'^u\ + (sin 25) ||ii|| . 

Consequently, using ||x|| = 1 and (H9|) . w.p. 1, and for 6 < it/A, say, 

M 2(TA/csin2(5 ^ , , 

cosZ{Bx,x) = , ' < T<C2<5. 50 

idVccosJ 

Now return to Figure [8j As a prelude to step (iii), we establish a lower bound for 
a = Z{p, Dp). Applying the sine rule (p3|) to ^ = Dp and r] = Xp = Dp + Bp, we obtain 

smZ{Dp,p) = ^^smZ{Bp,p). (51) 
\\Dp\\ 

On the assumption that p £ Ns, bound ([5U|) yields 

sinZ(i?p, /5) > sin(7r/2 — caJ), 

and (j49p implies that 



> ^(Tv^COS (5. 

On the other hand, since \\p\\ = 1, 

\\Dp\\ < \\D\\ <1 + Vs + a^Xmax{n-^ZZ^) 
<[l + a\l + V~c?]{l + o{l)), 
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w.p. 1 for large n. 

Combining the last three bounds into (jSip shows that there exists a positive a{5;a,c) 
such that if p £ Ng, then w.p. 1 for large n, 

sin a > sina{6;a,c) > 0. 

Returning to Figure [HI consider ^(Dp + Bp, Dp — Bp) = + 7. Since (3 > 7r/2 — c^5, 
we clearly have a + 7<7r — /3<7r/2 + c^5 and hence 

^{Dp + ByO, Dp — Bp) = min{a + 7, vr — a — 7} 

> min{a, 7r/2 — C36}. 

In particular, with 6 < 6Q{a,c), 

Z{p, S-p) > min{a(5), 7r/2 — 036} = a{6), 

which is our step (iii). As mentioned earlier, Lemma [1] applied to M = 5"- entails that 
Z(p, p^) > (l/3)a(5). For the rest of the proof, we write p+ for p. To summarize to this 
point, we have shown that if Z{p^,p) < S, then w.p. — > 1, 

Z(p+,p„)>(l/3)a(<5). (52) 

Note that S and have the same distribution: viewed as functions of random terms 
Z and v: 

S.{Z,v) = S+{Z, -v). 
We call an event A symmetric if {Z, v) £ A iS {Z, —v) £ A. For such symmetric events 

E[Zip+,p),A]=E[Z{p^,p),A]. 

From this and the triangle inequality for angles 

Ap+,p) + Ap, P-) > Ap+,p-), 

it follows that 

E[Z{p+,p),A]>^E[Z{p+,p.),A] (53) 

Hence 

E[Z{p+,p)] > E[Z{p+,p),A'] + ^E[Z{p+,p^),A]. 

By the symmetry of the distributions, conclusion (|52|) is also obtained w.p. ^ 1 if 
Z{p^,p) < 5. Consequently, letting A refer to the symmetric event As = {Z{p+,p) < 
6} U {Z(p_,p) < 6}, we have 

E[Z{p+,p)] > dPiA's) + ^E[Z{p+,p^),As] 
> min{(5,a(5)/6}(l +0(1)). 

This completes the proof of Theorem [2j The lower bo und proof for Theorem [3] proceeds 
similarly, but is omitted - for some extra detail, see 



24 



A. 4 Proof of Theorem [H 

We may assume, without loss of generality, that af > a2 > ■ ■ ■ > CTp. 
False inclusion. For any fixed constant t, 

cff >t for i = 1, . . . , A; and (jf <t ^ a'f < ^fk)- 

This threshold device leads to bounds on error probabilities using only marginal distribu- 
tions. For example, consider false inclusion of variable I: 

k 

p{af > < <t}+ n^f > t}- 
1=1 

Write Mn for a xfn)/''^ variate, and note from (fT5]) that al ~ crlMn- Set t = al{l — e„) for 
a value of to be determined. Since af > a\ and af < (T^(1 — a,i), we arrive at 

1 - 1 



P{c^^ > <T(fc)} < kP{Mn <l-en} + p{m„ > 



1 - Q„. 

2 



using large deviation bound f2^ . With the choice e„ = \/3a„/(2 + -v/3), both exponents 
are bounded above by —6(7) logn, and so P{FI} < p{k + l)n~^^'^\ 

False exclusion. The argument is similar, starting with the remark that for any fixed t, 

af <t for i > k,i ^ I and erf > t =^ af > afj^y 
Consequently, if we set t = cr|(l + e„) and use af > af.{l + an), we get 



p{^f < ay < p{^f >t}+ p{^f < t} 



i>k 



<{p- l)P{Mn > 1 + £„} + P{M„ > 

1 + ttn 



< 



n 



this time using ([30]) . 

The bound P{FE} < pA:n-''(T) + /ce-^W(i-2"")i°g" follows on setting e„ = 2a„/(2 + ^/3) 
and noting that (1 + an)~'^ > 1 — 2an . 

For numerical bounds, we may collect the preceding bounds in the form 

P{FEUFI) < [pA; + (p-l)(A;-l)]e-''(^)'°sn+_pe-f'(7)iogn/(i-«n)2 + (^_l)g-fe{7)iogn/(i+a„)2^ 

(54) 

A. 5 Proof of Theorem [5] 

Outline. Recall that 7„ = j{n^^ logn)^/^, that the selected subset of variables / is defined 
by 

I = {i^:af>a\l+jn)} 
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and that the estimated principal eigenvector based on / is written pj. We set 

and win use the triangle inequality d{pi,p) < d{pi,pi) + d{pi,p) to show that pi — > p. 
There are three main steps. 

(i) Construct deterministic sets of indices 

which bracket I almost surely as n — > cxd: 

/- C / C /+ w.p. 1. (55) 

(ii) the uniform sparsity, combined with C is used to show that 

dipi,p) ^' 0. 

(iii) the containment I G 1^, combined with |/^| = o(n) shows via methods similar to 
Theorem H] that 

d{pi,Pi) 0. 

Details. Step (i). We first obtain a bound on the cardinality of using the uniform 
sparsity conditions (fT7|) . Since < Cv~^/'^ 

\lt\<\{^ : C2z.-2/'?>^V7n}|, 

Turning to the bracketing relations (j55p . we first remark that o"^ = (T^X(„)/'^; ^^'^ when 

al = a\l + pl/a^)>a\l + a^jn). 

Using the definitions of I and writing M„ for a random variable with the distribution of 
we have 

< \I-\P{Mn < (l+7n)/(l+a+7n)}. 

We apply (f29l) with = (a+ — l)7n/ (1 + o+Tn) and for n large and 7' slightly smaller than 

nel > (a+ - 1)^7' log n, 

so that 

-Pn < cn^^^exp{-nel/4} < cn^^^'^+ 

with j'l = (a+ - l)^7'/4. If ^ > 12, then 7+ > 3 for suitable a+ > 2. 
The argument for the other inclusion is analogous: 

P+ = P{i ^ /+) < ^ >aHl+ 7n)} 

< pP{Mn > (1 + 7n)/(l + a_7n)} 
// 
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with 7_ = 3(1 — a_)^7'/16 so long as n is large enough. If ^ > 12, then 7" > 2 for 
suitable a_ < 1 — -y/8/9. 

By a Borel-Cantelli argument, (|55p follows from the bounds on and P^. 

Step (ii). For n > n{u;) we have I~ C I and so 

When G I^'^, we have by definition 

pl{n) < cr^a+7\/n-ilogn := e^, 
say, while the uniform sparsity condition entails 

\p\h < c'^-'^"- 

Putting these together, and defining = s*(n) as the solution of the equation Cs^^^'' = 
€n, we obtain 

V 
V 

= [2/(2 - q)]C'^el-'^ - 

as n — > 00. 



Step (iii). We adopt the abbreviations 

ui = {ui, : V e /), 

Zi = {zyi : 1^ £ i,i = 1, . . . ,n), 

Si = {Si,yi : v,v' e i). 

As in the proof of Theorem [3l we consider Sj = Sj — cr'^If, = Pipf + Ei and note that the 
perturbation term has the decomposition 

Ei = VsPipJ + piuj + uipj + a^{n~^ZiZj - I), 

so that 

||£^7||2 < ^^sll/O/lli + 2||/0/||2|k/||2 + (7'^[Xmax{n~^ZiZf) - 1]. 



Consider the first term on the right side. Since — p||2 from step (ii), it follows 
that II/O/II2 \\p\\- As before Vg 0, and so the first term is asymptotically negligible. 
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Let Zi+ = [zyi : v G Ini'^ — ^■.■■■■,n) and Ui+ = {u^ : v G I^). On the event 
f^n = {-^ C In}-, we have 

11^/11 < 11^7+ II 

and setting kj^ = |/^|, by the same arguments as led to (j40p . we have 

\\ui+f = a'^{k+/n){xln)/'i^){xlk+)/K) 0, 

since k+ = o{n) from step (i). 

Finahy, since on the event fin, the matrix Zj+ contains Zj, along with some additional 
rows, it follows that 

Xmax{n~'^ZjZj - I) < \max{n'^Zj+Zf+ - I) 

by ([38]) . again since k+ = o(n). Combining the previous bounds, we conclude that ||-E'/||2 
0. 

The separation 5n = \\pi\\2 ~^ ll/'lli > ^ and so by the perturbation bound 

dist{pi,pi) < (4/,5n)||£^/||2"-^-0. 
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